Detecting Hidden Passages in Documents
نویسندگان
چکیده
Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organization. We present our methodology to detect such hidden passages within a document. A document is divided into passages using various document splitting techniques, and a text classifier is used to classify such passages. Our detection rate, as shown empirically, is 76% with an equivalent precision. We provide a comparison of various passage identification methods and also evaluate the effects of passage length and feature selection in this process.
منابع مشابه
Passage detection using text classification
Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. ...
متن کاملAutomatic External Plagiarism Detection Using Passage Similarities - Lab Report for PAN at CLEF 2010
In this paper, we report our approach in detecting external plagiarism. For the pre-processing stage, we identify non-English documents and translate them into English using an online translator tool. Then we index and retrieve the top documents that are similar to the suspicious documents. We divide the retrieved documents into passages where each passage contains twenty sentences. The plagiar...
متن کاملExtracting Relevant Snippets for Web Navigation
Search engines present fix-length passages from documents ranked by relevance against the query. In this paper, we present and compare novel, language-model based methods for extracting variable length document snippets by real-time processing of documents using the query issued by the user. With this extra level of information, the returned snippets are considerably more informative. Unlike pr...
متن کاملDetecting Short Passages of Similar Text in Large Document Collections
This paper presents a statistical method for fingerprinting text. In a large collection of independently written documents each text is associated with a fingerprint which should be different from all the others. If fingerprints are too close then it is suspected that passages of copied or similar text occur in two documents. Our method exploits the characteristic distribution of word trigrams,...
متن کاملHMM-based Passage Models for Document Classification and Ranking
We present an application of Hidden Markov Models to supervised document classification and ranking. We consider a family of models that take into account the fact that relevant documents may contain irrelevant passages; the originality of the model is that it does not explicitly segment documents but rather considers all possible segmentations in its final score. This model generalizes the mul...
متن کامل